use babashka to download landing page
Introduce
Recently, I got a junior job on Upwork and the client wants me to copy some html templates, make some changes, such as replace imgs
, replace text
, and then submit html + css + img
files, no scripts.
I manually copied and pasted a few and thought
I'm a fucking
software engineer
, at least anengineer
, not a textile factory worker, writing a script to do this is the right way.
I am very familiar with JavaScript. If I use NodeJS
with Cheerio
and htmlparser2
, it will be easier.
But I saw a job that wanted to use Babashka
for API queries, I know it a bit from BABASHKA BABOOKA.
So I decided to develop the script with my poor Clojure knowledge.
Goals
- Create basic files, like:
./demo
├── css
│ ├── style.css
│ └──...other necessary need donwload
├── imgs
│ ├── img1.png
│ ├── img2.png
│ └── ...imgs
├── index.html
└── original.html
- Download the html to
original.html
with no edition, just for checking. - Download the html to
index.html
, go to the next steps - Remove all the script element
- Remove all the link element(opt)
- Move all the styles content into a new made file
style.css
- Download all the imgs to
imgs
file, and change its src to relative path like./imgs/img1.png
Prepation
- Enviroment for running babashark.
- Choose the dependencies I might need, here's the codes:
(require
'[babashka.pods :as pods]
'[babashka.http-client :as http]
'[babashka.fs :as fs]
'[clojure.string :as str])
(pods/load-pod 'retrogradeorbit/bootleg "0.1.9")
(require '[pod.retrogradeorbit.bootleg.utils :as bootleg]
'[pod.retrogradeorbit.hickory.select :as s])
Dependencies
babashka.http-client
, babashka.fs
, clojure.string
are easy to be understood, you can guess its function just by name.
pods
is for importing extra dependencies avaliable in babashka.
I need bootleg
to transform the html string to hickory
, it's a html parser.
And select
supply the select
function to select the element I want, or just recur the hickory content.
Development
Let's finish the goals one by one.
Create default files
(defn- create-default-files [dir-name]
(let [css-dir (str dir-name "/css")
imgs-dir (str dir-name "/imgs")
basic-css-file (str dir-name "/css/style.css")
original-html-file (str dir-name "/original.html")
submit-html-file (str dir-name "/index.html")]
(when (fs/exists? dir-name)
(fs/delete-tree dir-name))
(fs/create-dir dir-name)
(fs/create-dir css-dir)
(fs/create-dir imgs-dir)
(fs/create-file basic-css-file)
(fs/create-file original-html-file)
(fs/create-file submit-html-file)))
Download html
(defn- fetch-url [url]
(let [response (http/get url)]
(:body response)))
Download and write
Test this in your repl or comment:
(def testname "test")
(create-default-files testname)
(fs/write-lines
(str testname "/original.html")
[(fetch-url "https://digitalwebrocket.com/rocketpack/")])
You have downloaded the html content to test/original.html
now.
Handle the html
change the html to a
hickory
map like:Recur to do the different handle in goals:
(defn recur-hick [data]
(if (map? data)
(let [content (:content data)]
(-> data
(assoc
data
:content
;;handle the content here
(mapv recur-hick content))))
(do
(prn data)
data)))
- Write handle function:
(defn- download-img
"down load img to imgs-dir"
[url imgs-dir]
(let [img-name (last (str/split url #"/"))
img-path (str imgs-dir "/" img-name)]
(prn "downloading..." url)
(try
(fs/write-bytes
img-path
(-> url
(http/get {:as :bytes})
:body))
(catch Exception e (prn "error when download-img"))
)))
(defn handle-html-element [data]
(let [{:keys [tag attrs type content]} data]
(case tag
:img (do (download-img (get attrs :src) "test/imgs"))
:style (do
;; write the content to style.css
(fs/write-lines "test/css/style.css" content {:append true})
(assoc data :content []))
:script (assoc data :content [])
;; add the link of the css file
:head (update data :content
#(conj % {:type :element
:tag :link
:attrs {:rel "stylesheet"
:href "./css/style.css"}}))
data)))
- Add a main function recive the script params
(defn main []
(let [[url & args] *command-line-args*
;; recive 'dir-name' from the script or the 'last word' in the url
dir-name (or (first args)
(-> url
(str/split #"/")
last))]))
(main)
you could run the script like this now:
bb ./download_html.clj "https://digitalwebrocket.com/rocketpack/"
=> url = https://digitalwebrocket.com/rocketpack/, dir-name = rocketpack
Final Codes
(ns download-html
(:require
[babashka.pods :as pods]
[babashka.http-client :as http]
[babashka.fs :as fs]
[clojure.string :as str]))
(pods/load-pod 'retrogradeorbit/bootleg "0.1.9")
(require '[pod.retrogradeorbit.bootleg.utils :as bootleg]
'[pod.retrogradeorbit.hickory.select :as s])
(defn- fetch-url [url]
"http request to get the html content"
(let [response (http/get url)]
(:body response)))
(defn- download-img
"down load img to imgs-dir"
[url imgs-dir]
(let [img-name (last (str/split url #"/"))
img-path (str imgs-dir "/" img-name)]
(prn "downloading..." url "to" img-path)
(try
(fs/write-bytes
img-path
(-> url
(http/get {:as :bytes})
:body))
(catch Exception e (prn "error when download-img")))))
(defn- fix-url
"if the url begin with '//' add 'https:'"
[url]
(if (str/starts-with? url "//")
(str "https:" url)
url))
(defn- handle-html-element [data imgs-dir css-file-path]
"handle the img/style/script/head element"
(let [{:keys [tag attrs type content]} data]
(case tag
:img (let [url (:src attrs)
new-attr (-> data
(get :attrs)
(dissoc :srcset)
(assoc :src
(str "./imgs/"
(last (str/split url #"/")))))]
;; some path begin with // with no 'https'
(download-img (fix-url url) imgs-dir)
;; remove srcset and change src to a replated path
(assoc data :attrs new-attr))
:style (do
(when (vector? content)
(fs/write-lines css-file-path content {:append true}))
(assoc data :content []))
:script (-> data
(assoc :content [])
(assoc :attrs nil))
:head (update data :content
#(conj % {:type :element
:tag :link
:attrs {:rel "stylesheet"
:href "./css/style.css"}}))
data)))
(defn- recur-hick [html-data imgs-dir css-file-path]
"recur the hickory data"
(if (map? html-data)
(let [handled-data (handle-html-element html-data imgs-dir css-file-path)]
(-> handled-data
(assoc
:content
(mapv #(recur-hick % imgs-dir css-file-path)
(:content handled-data)))))
(do
#_(prn "xxx" html-data)
html-data)))
(defn main []
"the script main function"
(let [[url & args] *command-line-args*
dir-name (or (first args)
(-> url
(str/split #"/")
last))
css-dir (str dir-name "/css")
imgs-dir (str dir-name "/imgs")
basic-css-file (str dir-name "/css/style.css")
original-html-file (str dir-name "/original.html")
submit-html-file (str dir-name "/index.html")
html (-> url
fetch-url
str/trim)]
(when (fs/exists? dir-name)
(fs/delete-tree dir-name))
(fs/create-dir dir-name)
(fs/create-dir css-dir)
(fs/create-dir imgs-dir)
(fs/create-file basic-css-file)
(fs/create-file original-html-file)
(fs/create-file submit-html-file)
(fs/write-lines original-html-file [html])
(fs/write-lines submit-html-file
[(-> html
(bootleg/convert-to :hickory)
(recur-hick imgs-dir basic-css-file)
(bootleg/convert-to :html))])))
(main)
Improvement Todos
- Remove all the links
- Download all the css files and handle the situation
href="//"
- Download all the js files and handle the situation
href="//"
- Download all the font sources